Search CORE

8 research outputs found

GPOP: A cache- and work-efficient framework for Graph Processing Over Partitions

Author: Kannan Rajgopal
Lakhotia Kartik
Pati Sourav
Prasanna Viktor
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/11/2019
Field of study

Past decade has seen the development of many shared-memory graph processing frameworks, intended to reduce the effort of developing high performance parallel applications. However many of these frameworks, based on Vertex-centric or Edge-centric paradigms suffer from several issues, such as poor cache utilization, irregular memory accesses, heavy use of synchronization primitives and theoretical inefficiency, that deteriorate overall performance and scalability. Recently, we proposed a cache and memory efficient partition-centric paradigm for computing PageRank. In this paper, we generalize this approach to develop a novel Graph Processing Over Partitions (GPOP) framework that is cache-efficient, scalable and work-efficient. GPOP induces locality in memory accesses by increasing granularity of execution to vertex subsets called 'partitions', thereby dramatically improving the cache performance of a variety of graph algorithms. It achieves high scalability by enabling completely lock and atomic free computation. GPOP's built-in analytical performance model enables it to use a hybrid of source and partitioncentric communication modes in a way that ensures work-efficiency each iteration, while simultaneously boosting high bandwidth sequential memory accesses. We extensively evaluate the performance of GPOP for a variety of graph algorithms, using several large datasets. We observe that GPOP incurs up to 9x, 6.8x and 5.5x less L2 cache misses compared to Ligra, GraphMat and Galois, respectively. In terms of execution time, GPOP is upto 19x, 9.3x and 3.6x faster than Ligra, GraphMat and Galois respectively.Comment: 23 pages, 7 figures, 4 table

arXiv.org e-Print Archive

Crossref

PolarStar: Expanding the Scalability Horizon of Diameter-3 Networks

Author: Besta Maciej
Blach Nils
Hoefler Torsten
Isham Kelly
Lakhotia Kartik
Monroe Laura
Petrini Fabrizio
Publication venue
Publication date: 14/02/2023
Field of study

In this paper, we present PolarStar, a novel family of diameter-3 network topologies derived from the star product of two low-diameter factor graphs. The proposed PolarStar construction gives the largest known diameter-3 network topologies for almost all radixes. When compared to state-of-the-art diameter-3 networks, PolarStar achieves 31% geometric mean increase in scale over Bundlefly, 91% over Dragonfly, and 690% over 3-D HyperX. PolarStar has many other desirable properties including a modular layout, large bisection, high resilience to link failures and a large number of feasible sizes for every radix. Our evaluation shows that it exhibits comparable or better performance than other diameter-3 networks under various traffic patterns.Comment: 13 pages, 13 figures, 4 table

arXiv.org e-Print Archive

SPEC2: SPECtral SParsE CNN Accelerator on FPGAs

Author: Kannan Rajgopal
Lakhotia Kartik
Niu Yue
Prasanna Viktor
Srivastava Ajitesh
Wang Yanzhi
Zeng Hanqing
Publication venue
Publication date: 16/10/2019
Field of study

To accelerate inference of Convolutional Neural Networks (CNNs), various techniques have been proposed to reduce computation redundancy. Converting convolutional layers into frequency domain significantly reduces the computation complexity of the sliding window operations in space domain. On the other hand, weight pruning techniques address the redundancy in model parameters by converting dense convolutional kernels into sparse ones. To obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work to prune and accelerate spectral CNNs. First, we propose a systematic pruning algorithm based on Alternative Direction Method of Multipliers (ADMM). The offline pruning iteratively sets the majority of spectral weights to zero, without using any handcrafted heuristics. Then, we design an optimized pipeline architecture on FPGA that has efficient random access into the sparse kernels and exploits various dimensions of parallelism in convolutional layers. Overall, SPEC2 achieves high inference throughput with extremely low computation complexity and negligible accuracy degradation. We demonstrate SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.Comment: This is a 10-page conference paper in 26TH IEEE International Conference On High Performance Computing, Data, and Analytics (HiPC

arXiv.org e-Print Archive

Quickly Finding a Truss in a Haystack

Author: Bader David
Bombieri Nicola
Busato Federico
Fox James
Green Oded
Kannan Rajgopal
Kim Euna
Lakhotia Kartik
Prasanna Viktor
Singapura Shreyas
Zeng Hanqing
Zhou Shijie
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerous formulations and algorithms for finding the maximal k-truss of a graph, many of these tend to be computationally expensive and do not scale well. Many algorithms are iterative and use static graph triangle counting in each iteration of the graph. In this work we present a novel algorithm for finding both the k- truss of the graph (for a given k), as well as the maximal k-truss using a dynamic graph formulation. Our algorithm has two main benefits. 1) Unlike many algorithms that rerun the static graph triangle counting after the removal of nonconforming edges, we use a new dynamic graph formulation that only requires updating the edges affected by the removal. As our updates are local, we only do a fraction of the work compared to the other algorithms. 2) Our algorithm is extremely scalable and is able to concurrently detect deleted triangles in contrast to past sequential approaches. While our algorithm is architecture independent, we show a CUDA based implementation for NVIDIA GPUs. In numerous instances, our new algorithm is anywhere from 100X-10000X faster than the Graph Challenge benchmark. Furthermore, our algorithm shows significant speedups, in some cases over 70X, over a recently developed sequential and highly optimized algorithm

Crossref

Catalogo dei prodotti della ricerca

A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network

Author: Besta Maciej
Blach Nils
De Sensi Daniele
Domke Jens
Ferrari Marcel
Harake Hussein
Hoefler Torsten
Iff Patrick
Konieczny Marek
Kubicek Ales
Lakhotia Kartik
Li Shigang
Petrini Fabrizio
Publication venue
Publication date: 09/10/2023
Field of study

Novel low-diameter network topologies such as Slim Fly (SF) offer significant cost and power advantages over the established Fat Tree, Clos, or Dragonfly. To spearhead the adoption of low-diameter networks, we design, implement, deploy, and evaluate the first real-world SF installation. We focus on deployment, management, and operational aspects of our test cluster with 200 servers and carefully analyze performance. We demonstrate techniques for simple cabling and cabling validation as well as a novel high-performance routing architecture for InfiniBand-based low-diameter topologies. Our real-world benchmarks show SF's strong performance for many modern workloads such as deep neural network training, graph analytics, or linear algebra kernels. SF outperforms non-blocking Fat Trees in scalability while offering comparable or better performance and lower cost for large network sizes. Our work can facilitate deploying SF while the associated (open-source) routing architecture is fully portable and applicable to accelerate any low-diameter interconnect

arXiv.org e-Print Archive

In-network Allreduce with Multiple Spanning Trees on PolarFly

Author: Besta Maciej
Hoefler Torsten
Isham Kelly
Lakhotia Kartik
Monroe Laura
Petrini Fabrizio
Publication venue: Association for Computing Machinery
Publication date: 17/06/2023
Field of study

Allreduce is a fundamental collective used in parallel computing and distributed training of machine learning models, and can become a performance bottleneck on large systems. In-network computing improves Allreduce performance by reducing packets on the fly using network routers. However, the throughput of current innetwork solutions is limited to a single link bandwidth. We develop, compare and contrast two different sets of Allreduce spanning trees embedded into PolarFly, a high-performance diameter-2 network topology. Both of our solutions offer theoretically guaranteed near-optimal performance, boosting Allreduce bandwidth by a factor equal to half the network radix of nodes. While our first set offers low-latency with trees of depth-3, the second set offers congestion-free implementation which reduces complexity and resource requirements of in-network computing units. In doing so, we also distinguish PolarFly as a highly suitable network for distributed deep learning and other applications that employ throughput-bound large Allreductions

Repository for Publications and Research Data